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Abstract 

We consider a suboptimal solution path algorithm for the Support Vec- 
tor Machine. The solution path algorithm is an effective tool for solving 
a sequence of a parametrized optimization problems in machine learning. 
The path of the solutions provided by this algorithm are very accurate 
and they satisfy the optimality conditions more strictly than other SVM 
optimization algorithms. In many machine learning application, however, 
this strict optimality is often unnecessary, and it adversely affects the com- 
putational efffciency. Our algorithm can generate the path of suboptimal 
solutions within an arbitrary user-specified tolerance level. It allows us to 
control the trade-off between the accuracy of the solution and the compu- 
tational cost. Moreover, We also show that our suboptimal solutions can 
be interpreted as the solution of a perturbed optimization problem from the 
original one. We provide some theoretical analyses of our algorithm based 
on this novel interpretation. The experimental results also demonstrate 
the effectiveness of our algorithm. 

1 Introduction 



Recently, the solution path al gorithm ( Efron et al. , 2004 ; Hastie et al. , 2004 



Cauwenberghs fc Poggiq , |200l| ) has been widely recognized as one of the effective 
tools in machine learning. It can efficiently compute a sequence of the solutions 
of a parametrized optimization problem. This technique is originally developed 



as parametric programming in the optimization community (Best, 1982| ) 



In a class of parametric quadratic programs (QPs), the solution path is 
represented as a piecewise- linear function of the problem parameters. If we 
regard the regularization parameter of the Support Vector Machine (SVM) as 
problem parameter, the optimization problem for the SVM is categorized in 
this class. Therefore, the SVM solutions are represented as piecewise-linear 
functions of the regularization parameter. 

The solutions of these parametric QPs are characterized by active constraint 
set in the current solution. The linearity of the path comes from the fact that the 
Karush-Khun- Tucker (KKT) optimality conditions of these problems are repre- 
sented as a linear system defined by the current active set, while the "piecewise- 
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ness" is the consequence of the changes in the active set. The piecewise-hnear 
solution path algorithm repeatedly updates the linear system and active set. 
The point of active set change is called breakpoint in the literature. The path 
of solutions generated by this algorithm is very accurate and they satisfy the 
optimality conditions more strictly than other algorithms. 

Many machine learning problems, however, do not require strict optimality 
of the solution. In fact, one of the popul ar SVM opt imization algorithm, called 
sequential minimal optimization (SMO) Piatt ( |l999 ), is known to produce sub- 



optimal (approximated) solution, where the tolerance to the optimality (degree 
of approximated) can be specified by users. In many experimental studies, it 
has been demonstrated that the generalization performances of these suboptimal 
solutions are not significantly different from those of strictly optimal ones. 

Therefore, the strict optimality of the solution path algorithm is often un- 
necessary. Furthermore, it adversely affects the computational efficiency of the 
algorithm. In fact, the solution path algorithm can be very slow when it en- 
counters a large number of (seemingly redundant) breakpoints. Although some 
empirical studies suggest that the number of breakpoints grows linearly in the 
input size, in the worst case, it can grow exponentially (Gartner ct al, 2009| ). 



Another difficulty is in starting the solution path algorithm from an approxi- 
mated solution, for example obtained by SMO, because it docs not satisfy the 
strict optimality requirement. 

In order to address these issues in the current solution path algorithm, 
we introduce a suboptimal solution path algorithm. Our algorithm also gen- 
erates piecewise-linear solution path, but the optimality tolerance (approxima- 
tion level) can be arbitrary controlled by users. It allows to control the trade-off 
between the accuracy of the solution and the computational cost. 

The presented suboptimal solution path algorithm has the following proper- 
ties. 

• First, the algorithm can reduce the number of breakpoints (which is the 
main computational bottleneck in solution path algorithm) by allowing 
multiple active set changes at one breakpoint. Although this modification 
causes what is called degeneracy problem, we provide an efficient and 
accurate way to solve this issue. We empirically show that reducing the 
number of breakpoints can work effectively to the computational efficiency. 

• Second, the suboptimal solutions obtained by the algorithm can be inter- 
preted as the solution of a perturbed optimization problem from the original 
one. This novel interpretation provides several insights into the properties 
of our suboptimal solutions. We present some theoretical analyses of our 
suboptimal solutions using this interpretation. 

We also empirically investigate several practical properties of our approach. 
Although, our algorithm updates multiple active constraints at one breakpoint, 
we observe that the entire changing patterns of the active sets are very similar to 
those of the exact path. Moreover, despite its computational efficiency, the gen- 
eralization performance of our suboptimal path is comparable to conventional 
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one. 

To the best of our knowledge, there are no previous works for suboptimal 
solution path algorithm with controllable optimality tolerance that can be ap- 
plicable to standard SVM formulation Q Although many authors mimic the 



solution path by just repeating the warm-start on finely grid points (e.g., Fried- 



man ct al., 2007), this approach does not provide any guarantee about the in- 



termediate solutions between grid points. In this paper we focus our attention 
to the solution path algorithm for standard SVM, but the presented approach 
can be applied to other problems in the aforementioned QP class. 



2 Solution Path for Support Vector Machine 

In this section, we describe the solution path algorithm for regularization pa- 
rameters of Support Vector Machine (SVM). 

2.1 Support Vector Machine 

Suppose we have a set of training data {(a;,, y^)}"^]^, where Xi G X C MP is 
the input and yi G { — is the output class label. SVM learns a linear 
discriminant function f{x) ~ w^^{x) -I- ao in a feature space T, where $ : 
A" — )■ J-" is a map from the input space X to the feature space w G J- is a 
coefficient vector and ao S K is a bias term. 

In this paper, we consider the optimization problem of the following form: 

nnn i + ELi ^.e. (1) 

s.t. ytf{xi) > 1 - Ci, 6 > 0, i = l,...,n, 

where {Ci}^^^ denotes regularization parameters. This formulation reduces to 
the standard formulation of the SVM when all C^'s are the same. Our discussion 
in this paper holds for arbitrary choice of C^'s. 
We formulate the dual problem of (|l|) as: 

max ot^Qa + l^a 

a 2 ^ (2) 

s.t. a = 0, < a < c, 

where a = [ai, . . . ,a„]^, c = [Ci, . . . ,C„]^ and element of Q G R"^" is 
Qij = UiUj^ixi)^ ^{xj). Note that, we use inequalities between vectors as the 
element- wise inequality (i.e., a < c ai < Ci for i = I, . . . ,n ). Using kernel 
function K{xi,Xj) = ^(xi)^ ^(xj), discriminant function / is represented as: 



f{x) = ^ aiyj<{x, Xi) + ao. 



2 = 1 



Giesen et al 



(201C) proposed approximated path algorithm with some optimality guar- 



antee that can be applicable to L2-SVM without bias term. 
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In what follows, the subscript by an index set such as vx for a vector v = 
[vi, - ■ ■ ,Vn]^ indicates a sub- vector of v whose elements are indexed hy I = 
{ii, . . . , For example, for v — [a,b,c]^ and I = {1,3}, vi = [a,c]^. 

Similarly, the subscript by two index sets such as iVfj^^x^ ^ot a matrix M € 
■j^nxn jj^gnotes a sub-matrix whose rows and columns are indexed by Ii and I2, 
respectively. The principal sub-matrix such as Mx,x is abbreviated as Mx- 



2.2 Solution Path Algorithm for SVM 

In this paper, we consider the solution path with respect to the regularization 
parameter vector c. To follow the path, we parametrized c in the following 
form: 



(0) + e± 



c- ' — c 

where c^^^ = [C[^\ . . . , C^^]^ is some initial parameter, d = [di, . . . , dn]^ is a 
direction of the path and 6 > 0. We trace the change of the optimal solution of 
the SVM when 9 increases from 0. 

Let {af^}"^Q be the optimal parameters and be the outputs /(a;,) 

at 9. The KKT optimality conditions arc summarized as: 

y^f^>l, if af^^O, (3a) 

yjf^^l, if 0<af)<cf, (3b) 

?/jr<l, if (3c) 
y^a = 0. (3d) 

We separate data points into three index sets Ai,0,I C {1, . . . , n} in such a 
way that these sets satisfy 

teO ^ > =0, (4a) 

leM ^ = l,af^ e [0,a], (4b) 

iel 2/«/f^<l,«f^=C„ (4c) 

and we denote these partitions altogether as tt := {0,A4,2). If every data 
point belongs to one of the three index sets and equality (3d) holds, the KKT 
conditions (0) are satisfied. As long as these index sets are unchanged, we 
have analytical expression of the optimal solution in the form of = 

(6) 

al + A0/3i, i = 0, . . . , n, where A9 is the change of 9 and {/3i}"^o constants 
derived from sensitivity analysis theory: 

Theorem 1. Let tt = (0,A^,I) be the partition at the optimal solution at 9 
and assume that 
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is non-singulam. Then, as long as tt is unchanged, {/3i}f=o is given by 



= M 



Qm.i 



dx, f3^ = 0, /3i = dx- 



(5) 



The proof is in Appendix This theorem can be viewed as one of the 
specific forms of the sensitivity theorem Fiacco (1976). It can be derived from 
the KKT conditions (pi) and the similar properties are repeatedly used in various 



Hastic ct al. 



solution path algorithms in machine learning ( Cauwenbcrghs & Poggio, 2001 
^0041) ■ 



(e+A8) 



„(") 



Using the above theorem, we can update the solution by a] ' ' = a- 
AS^i as long as tt is unchanged. However, if we changes d, the optimal partition 
TT could also changes. Those change points are called breakpoints. In the solution 
path algorithm, the optimality conditions are always kept satisfied by precisely 
detecting the breakpoints and updating tt properly. 



3 Suboptimal Solution Path 

In this section, we develop a suboptimal solution path algorithm for the SVM, 
where the tolerance to the optimality conditions can be arbitrary controlled 
by users. The basic idea is to relax the KKT optimality conditions and allow 
multiple data points to move among the partition tt at the same time. Note 
that it reduces the number of breakpoints and leads to the improvement in 
its computational efficiency: allowing us to control the balance between the 
accuracy of the solution and the computational cost. 

3.1 Approximate Optimality Conditions 

First, we relax the conditions (^ as 

z e ^ y./f ' > 1 G [-£2,0], (6a) 

ieM^ ^ G [l-ei,l+ei],af' G [-£2, cf ^+£2], (6b) 
z G I ^ y./f ' < 1+ei, af ) G [cf \ +62], (6c) 

where ei > and £2^0 specify the degree of approximation. If we set 
£1 = £2 = 0, these conditions reduce to (Q). 

Our algorithm changes 9 while keeping the above conditions (^ satisfied. Let 
^0 = be the initial value of 9 and the non-decreasing sequence 9q < 61 < 92 < 
. . ., be the breakpoints. Suppose we are currently at 9k, the next breakpoint 
9k+i is characterized as the point that we can not increase 9 without violating 
the conditions (||) or changing index sets tt. 

^The invertibility of the matrix M is assured if and only if the submatrix is positive 
definite in subspace {z £ RI-'^^ | J/^z = 0}. 
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If we set {/3i}"^g by then yif-^\ i G M, anda^'^'', i G OUl, are constants 
To increase 9 from 0k, we only need to check the following inequalities: 



a, 
Vif, 



AOg, 



> 
e 
< 



l-ei 

1-^2, a 

1 -ei, 



£2], 



I e O, 

i G 
i GX, 



where is the change of output yifi which is defined by g = Q(3 + y/3o- We 
want to know the maximum A9 which satisfies all of the above inequalities. We 
can easily calculate the maximum A9 for each inequality as follows: 



60 = 

©A^f = 
St = 



(l-£i-2;,/f''))/.g. zGO,.g. <0}, 
-(af^^+£2)M|zGA^,A<0}, 

{(l+ei-2;,/f'=))/5.|*GX,.9, >0}, 



Since we have to keep all of the inequalities satisfied, we take the minimum of 
these values: A9 = min6, where 6 = {60, ©Mf, 0ai„, ©i}- Then we can find 

9k+i =9k + A9. 

Although we detect 9k+i, it is necessary to update tt to go beyond the 
breakpoint. Conventional solution path algorithms allow only one data point 
to move between the partition tt at each breakpoint. For example, a^, i G M, 
reaches 0, the algorithm transfers the index i from to O (Figure 1(a) ). In 
our algorithm, multiple data points are allowed to move between the partitions 
TT at the same time in order to reduce the number of breakpoints. 



3.2 Update Index Sets 



At a breakpoint, our algorithm handles all the data points that violate the strict 
inequality conditions (Q) rather than the relaxed ones (|) (Figure [T(b)|). This 
situation can be interpreted as what is called degeneracy in the parametric pro- 
gramming (Rittcr, 1984). Here, degeneracy means that multiple constraints hit 
their boundaries of inequalities simultaneously. Although degenerate situation 
rarely happens in conventional solution path algorithms, it is not the case in 
ours. The simultaneous change of multiple data points inevitably brings about 
"highly" degenerate situations involved with many constraints. In degenerate 
case, we have a problem called the cycling. For example, if we move two in- 
dices i and j from to O at the breakpoint, then both or either of them may 
immediately return to A^. To avoid the cycling, we need to design an update 
strategy for tt that can circumvent cycling. 

The degeneracy can be handled by several approaches which are known in 



the parametric programming literature. Rittei (1984) showed that the cycling 
can be dealt with through the well-known Bland's minimum index rule in the 
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(9) 



(a) Exact path 




(b) Suboptimal path 



Figure 1: An illustrative example of the breakpoint. The points of the vertical 
dashed lines are breakpoints, (a) At the breakpoint in the upper plot, ai,i € A4, 
reaches 0. Since the index i is transferred from to 0, = on the right side 
of the vertical line. In the lower plot, yifl = 1 on the left side of the vertical 

(0) 

line and yif^ > 1 on the right side of the vertical line. At the breakpoint, the 



data point i satisfies the both of the optimality conditions (^) and (4a) for A4 
and O, respectively, (b) At the breakpoint in the upper plot, one oi ai,i E Ai, 
reaches —£2- In the lower plot, both of the two lines are in [1 — ei, 1 + ei]. In 



this case, these two points satisfy the both of the optimality conditions (6b) 
and ( |6al) for Ai and O, respectively. It does not necessarily mean that these 
two data points should move to O: either of them have a possibility to stay in 
A4 even after the breakpoint. This situation is called degeneracy in parametric 
programming literature. 



linear programming (Bland, 1977). However, in the worst case, this approach 
must go through all the possible patterns of next tt. Since we need to evaluate 
{ft}"=o ™ each iteration, a large number of iterations may cause additional 
computational cost . In this paper , we p rovide more essential solution to this 



problem based on ( Bcrkclaar ct al. , 1997 ) 



Suppose we are currently on the breakpoint Ok- Let 



Bo 
Bi 



{i I o^"^ < 0,A < 0,i e A4}\J 



{i I a 



>a 



> di,i e M}U 



{^\y^f^'''^ >l,g^>0,ieI}. 



Bo is the set of indices which satisfy the conditions (|6a| ) and (6b) for being the 
member of Ai and O simultaneously at 9k. Similarly, indices in Bx satisfy the 
conditions (^) and ( |6c| ) for being the member of Ai and I at 9k. Moreover, let 
us define sum of these two sets as 



B = Bo^ Bi. 
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Our task is to partition these indices to O, and I correctly so that it does 
not cause the cychng. 

In our formulation, due to the approximation by Si and £2, the cycling may 
not occur at A9 = immediately. For example, suppose that i move to Ai 
from O and its parameter is a; = 0. In the next iteration, we need to check 
ai + A9l3i > ~£2. If A < 0, then we obtain M < -62/ Pi > 0. Although it 
allows A9 > 0, the index i may return back to O. This situation can also be 
considered as cycling. 

Let TTfe = {Ok,Mk,Ik) be n in [6*^, 6*^+1]. At 9k+i, if and only if the cycling 
does not occur, it can be shown that the following conditions hold: 

A > 0, = 0, for i e Mk+i n Bo, (7a) 

A = 0,9^> 0, for I e Ok+i n Bo, (7b) 

A < di,gi = 0, for i e Mk+i n Bi, (7c) 

A = di, gi < 0, for i e Ik+i n Bx. (7d) 

Although A and are usually calculated using tt, our approach allows us 
to calculate A a-nd 5i without knowing tt so that they can satisfy the above 
conditions. If the gradient /3. which is defined in (^), satisfies the following 
conditions, we can find the next partition TTk+i to satisfy (^. The conditions 
are: 

g^P^ = 0, 5, > 0, A > 0, i € Bo, 

g^{d^ - A) - 0, 5, < 0, A < d,, teBx, 

If we know such f3 and g, using the following update rule, we can determine 
TTk+i as: 

Mk= A^fe+i U{i I A > 0,5. = 0,^ e Bo} 

Li {i\ < di,g^^O,i e Bi}, , 
Ok= 0^+iU{i\ 13,^0, g,> 0,1 £ Bo}, 
= Ik+i; U {i I A = di,gi < 0, i e ;Bi}, 

where O^^ 1 = Ok \ B, Mk+i = Mk\B and 1^+ 1 = Ifc \ B. 

Remark 1. By definition, the update rule ^ guarantees that the non-cycling 
conditions (Q) hold. 

To use (^, we need (3 ^ which satisfies (^. The following theorem shows 
that it can be obtained from a quadratic programming problem (QP): 

Theorem 2. Let Pq, (3 and g be the optimal solutions of the following QP 
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problem: 



min^ Qipi + Y 9t{Pt ~ di) (10) 

QBo > 0: 3so > 0, Qbx < 0. 3bx < di, 
s.t. { Qm,^, - 0, 3o,,, = 0, 3x,^, = , 

and TT is determined by using (3 and g. Then /Sq, f3 and g satisfy and 
they are equal to the gradient f3o, (3 and g, respectively. 

Although the detailed proof is in Appendix, we can provide clear inter- 
pretation of this optimization problem. The objective function and inequality 
constraints corresponds to (^) and the other constraints correspond to the linear 
system (||). It can be shown that the optimal value of the objective function is 
0. Given the non-negativity of each term in the objective, we see that (||) holds 
(see Appendix |b| for detail). 

The optimization problem ( |l0| ) has 2n + 1 variables and 2\B\ + 2n + I con- 
straints. However, we can reduce these sizes to \B\ variables and 2\B\ constraints 
by arranging the equality constraints^. The detailed formulation of the reduced 
problem is in Appendi x |Q . If the size of \B\ is large, it may take large compu- 
tational cost to solve (|10|). To avoid this, we set the upper bound B for the 
number of elements of B. In the case of \B\ > B, we choose top B elements 
from the original B by increasing order of Q as the elements of B. 



3.3 Algorithm and Computational Complexity 

Here, we summarize our algorithm and analyze its computational complexity. 
At the fc-th breakpoint, our algorithm performs the following procedure: 

stepl Using tt^, calculate /3o,/3 and g by (^) 

step2 Calculate the next breakpoint 9k+i and update a'^\ cx.^^\ c^^'' ; 

step3 Solve ( p^ and calculate tt^+i by (||) 

In stepl, we need to solve the linear system (|^). In conventional solution path 
algorithms, we can update it using rank-one-update of an inverse matrix or a 
Cholesky factor from previous iteration by 0{\M\'^) computations. In our case, 
we need rank- m- update at each breakpoint, where 1 < m < B. When we set B 
as some small constant, the computational cost still remains 0(|A1p). Including 
the other processes in this step, the computational cost becomes 0{n\Ai\). In 
step2, given (3 and g, we can calculate all the possible step length Q by 0{n). 
In step3, since the optimization problem (^0|) becomes convex QP problem with 

3 

In the case of | A^j,_|_i | = 0, the reduced problem has \B\ + 1 variables 2\B\ + 1 constraints. 



9 



\B\ variables, it can be solved efficiently by some standard QP solvers in the 
situation \B\ is relatively small compared to n. When we set B as some constant, 
the time for solving this optimization problem is then independent of n. 

Put it all together, in the case of constant B, the computational cost of 
each breakpoint is 0{n\A4\). This is the same as the conventional solution 
path algorithm. However, as we will see later in experiments, our algorithm 
drastically reduces the number of breakpoints especially when we use large ei 
and £2- 

4 Analysis 

In this section, we provide some theoretical analyses of our suboptimal solution 
path. 

4.1 Interpretation as Perturbed Problem 

An interesting property of our approach is that the solutions always keep the 
optimality of an optimization problem which is slightly perturbed from the 
original one. The following theorem gives the formulation of the perturbed 
problem: 

Theorem 3. Every solution cx^^^ in the suboptimal solution path is the optimal 
solution of the following optimization problem: 



where perturbation parameters p,q (z M" are in —Sil < p < eil and < q < 
£2!, respectively. 

Proof. Let £^ G M" and k G K be the Lagrange multipliers. The Lagrangian 
is 



max a Qa. + (l + p) a 

a 2 

s.t. y^a = 0, q< a< c'-^^ + q. 



(11) 



L 



-^a^Qa + (1 +p)^q; 

+ {a + qyC + {c^"'' + q- ay^'^ + Ky^a, 



and the KKT conditions are 



dL _ 
aa 



= ~Qa + l+p + $, C + Ky = 0, 

£,~{at + qi) = 0, i = 

q <a < c(^) + q. 



(12a) 
(12b) 
(12c) 

(12d) 
(12e) 
(12f) 



y a = 0, 
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Substituting a = oS^^ and k = —a^^\ i-th element of (12^) can be written as 



J/ J, 



1 + Pi + ~ £,t ■ Considering this and the conditions of suboptimal 



solution oS^^ 
for i e M, 

^j^'s satisfy the non- negativity constraint (12b) 



), there exist pi G [— ei,ei] and £,f which satisfy S^f = = 
= 0, 4^ > 0, for i e O and_|+ > 0, = 0, for i e I. These 



The complementary conditions (12c) and (12d) for i G Ai hold from = 
= 0. For i G O, since = 0, we don't have to check (12d). In this case, if 
wc se t Qi = —a) £ [0,62], then (12c) holds. It can be shown in a similar way 



that ( |l2cD and (|l2d|) hold for i G I. 

Our suboptimal solution path algorithm always satisfies the equality con- 
straint of the dual (^ and the box constraint (12e) satisfied. Therefore, we see 
(ph holds. □ 



The problem ( |11| ) can be interpreted as the dual problem of the following 
form of the SVM: 



1 " 
min -w'^w + y^£{l+pi-yj^), 



(13) 



where 



(Cf'+g,)6, for6>0, 
-Qi^i, for £,1 < 0, 



is a loss function. We see that the perturbations present in the loss term. 



4.2 Error Analysis 

Wc have shown that the solution of the suboptimal solution path can be inter- 
preted as the optimal solution of the perturbed problem (|T^). Here, wc consider 
how close the optimal solution of the perturbed problem to the solution of the 
original problem in terms of the optimal objective value. 

Let D{cy.) and D{a.) be the dual objective functions of the original optimiza- 
tion problem (^ and the perturbed problem (pT|), respectively. From the affine 
lower bound of D{a), we obtain 

D{a) < D{a*) + a* + { Qa* + 1 + [a - a*), 

where a* is the optimal solution of the original problem. Let a be the optimal 
solution of the perturbed problem. Substituting a = a and adding UqU^ {a — 
a*) = to the right hand side, we obtain 

D{a) - D{a*) < p^ a* + (C + p)^(S - a*), (14) 

where |* = Qa* - ya^ + 1. Note that > 0, Cm = ^ and Co < 0, where 
I, Ai and O represent the optimal partition of the original problem (H). Here, 
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Table 1: Data set 



Data set 


n 


P 


internet ad 


2359 


1558 


spam 


4601 


57 


a5a 


6414 


123 


w5a 


9888 


300 



we define I = {i \ + Pi > 0, i e I}, O ^ [i \ £* + p^ < i € O} and 



A4 = {1, . . . ,n}\{0 Ul). From the right hand side of (14), we obtain 



From the duality theorem, this also bounds the difference of the primal objective 
value. Comparing the original objective function (^, this bound can be consid- 
ered small when pi and qi is enough small compared to ^* and C,; . In this view 
point, this bound gives theoretical justification for our intuitive interpretation. 
The bound for D{cx.*) — D{a.) can be also derived in the same manner. 



5 Experiments 

In this section, we illustrate the empirical performance of the proposed approach 
compare to the conventional exact solution path algorithm. Our task is to trace 
the solution path from c'--^^ ~ 10~^/ri x 1 to c'^^ = lO^/n x 1. Since all the 
elements of c'^' takes the same value in this case, we sometimes refer to this 
common value as C^^^ (i.e., c^^^ = C^^^ x 1). The RBF kernel K{xi,Xj) = 
exp(— 7||a;i — a;j||2) is used with 7 — 1/p where p is the number of features. 
To circumvent possible numerical instability in the solution path, we add small 
positive constant 10~^ to the diagonals of the matrix Q. 

Let e > be a parameter which controls the degree of approximations. In 
this paper, using e, we set ei and 62 as ei = e and £2 = e x (7'^*°^ respectively, 
where 9k is the previous breakpoint. We set £2 using relative scale to C*-**"^. 
Table ^ lists the statistics of data sets. These data sets are available from 



LIBSVM s ite ( [Chang fc Lin| , |200l|) and UCI data repository ( [Asuncion fc New _ 
man, p007[ ). We randomly sampled n data points from the original data set 10 
times (we set n be approximately 80% of the original number of data points in 
the table). The input x of each data set is linearly scaled to [0, 1]^. 

Figure |^ shows the comparison of the CPU time and the number of break- 
points. To make fair comparison, the initialization is not included in the CPU 
time. In these results, we set _B = 10 and we investigated the relationship be- 
tween the computational cost and the degree of approximation by examining 
several settings of e e {0.001,0.01,0.1,0.5}. The results indicate that our ap- 
proach can reduce the CPU time especially when e is large. The number of 
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(c) a5a (d) w5a 



Figure 2: Log plot of CPU time and the number of breakpoints. The horizontal 
axis of each plot is the degree of the approximation. The cirele denotes the CPU 
time (left axis) and the cross mark denotes the number of breakpoints (right 
axis) of the suboptimal path. The top dashed line of each plot means both of 
the CPU time and the number of breakpoints of the exact path. The relative 
scale of the left and right axes are the same. 



breakpoints were also reduced, in the same way as the CPU time. In our ap- 
proach, since we need rank-m-update of matrix in each breakpoint {1 < m < B), 
an update in a breakpoint may take longer time than rank-one-updatc which 
is needed in the conventional solution path algorithm. We conjecture that this 
is why the decrease in the number of breakpoints was slightly faster than the 
CPU time. However, since the maximum value of \B\ was set as S = 10 in this 
experiment, this additional cost was relatively small compared to the effect of 
the reduction of the number of breakpoints. 

Next, we investigated the effect of B. Figure || shows the CPU time and 
the number of breakpoints for wla data (n = 2477, p = 300) with B = 10 and 
B = n. When B = n, there are no upper bounds for \B\. In the left plot, when 
B = n,we see that the CPU time is longer than the case of i? = 10. In this data 
set, this difference of the CPU time mainly comes from the cost of the matrix 
update and QP (|l^) whose size is proportional to \B\ (data not shown). On the 



13 



3 
Q_ 

O2 









1200 


























1 1000 








+ 






■t 800 








a 






^ 600 
03 

E 400 

c 

5 200 
1- 











Exact 



e=0.01 8=0.5 



B=10 B=10 
e=0.01 6=0.5 



1 

0.8 

1 

I 0.6 

i 

;o.4 
i 

0.2 




(a) CPU time 



B=n B=n 
e=0.01 6=0.5 f 

(b) Breakpoint 



B=10 B=10 
6=0.01 6=0.5 



Figure 3: The eomparisons for different settings of B. 
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Figure 4: Comparisons of the behavior of tt 



other hand, in the left plot, the number of breakpoints is stable in the both case 
of B = n and B = 10, and interestingly, the number itself is almost the same in 
these two settings. Our results suggest that too many B does not contribute to 
reduce the number of breakpoint. Although these unstable results in i? = n is 
not always happen, we observed that it is more stable to use B = 10 or B = 100 
in several other data sets. 

Wc also compared the difference of tt between the exact solution path and 
the suboptimal path in order to see the degree of approximation in terms of the 
active set. Let <E {0, 1} be an indicator variable which has 1 when a data 
point i belongs to different set among A4, O and I between two solution paths. 
Figure 4(a) shows plots of 10 runs average of ^j/"- ^01 e = 0.5 in a5a data 



set. We see that the difference is at most about 10%. Figure 4(b) shows the size 
of each index set (this plot is one of 10 runs). Although the small differences 
exist, the changing patterns are similar each other. 
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Tabic 2: Test error rate and its standard error 



data 


exact path 


e = 0.5 


ad 


0.0326 (0.0021) 


0.0328 (0.0026) 


spam 


0.0770 (0.0036) 


0.0812 (0.0037) 


a5a 


0.1587 (0.0025) 


0.1597 (0.0031) 


w5a 


0.0171 (0.0012) 


0.0176 (0.0010) 



Table ^ shows results of test error rate comparison for e = 0.5. We used 60% 
of the data for training, 20% for validation and 20% for testing. In each data 
set, we see that the performances of our suboptimal solutions are comparable 
to the exact solution path. 



6 Conclusion 

In this paper, we have developed a suboptimal solution path algorithm which 
traces the changes of solutions under the relaxed optimality conditions. Our 
algorithm can reduce the number of breakpoints by moving multiple indices 
in TT at one breakpoint. Another interesting property of our approach is that 
the suboptimal solutions exactly correspond to the optimal solutions of the 
perturbed problems from the original SVM optimization problems. The exper- 
imental results demonstrate that our algorithm efficiently follows the path and 
it has similar patterns of active sets and classification performances compared 
to the exact path. 
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Appendix 

Here, we provide proofs of Theorems 1, 2 and simplified formulation of the optimization 
problem (px|). 

A Proof of Theorem 1 

Here, we provide a proof of the following theorem: 

Theorem 1. Let tt = {0,M,T) be the partition at the optimal solution at 9 and 
assume that 

Vm Qm. 
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is non-singular. Then, as long as n is unchanged, {/3i}"^o given by 



Vi 



dx, 



/3o =0, 
(3i = di. 

Proof. As long as tt is unchanged, ai for i £ O and i £ X must be 



(A.l) 



0, ieo, 



Therefore, we see that /3q = and (3^ = di. From the definition of AI, at the 
optimal, the following linear system holds 



Combining with the equality constraint of the dual problem a. — 0, we obtain the 
following linear system: 



,{») 



M 



+ 



Solving this, we obtain 



Qm,x. 



Vi 
Qm.i. 



Using c'*+^^) = c<®' + ed, wc can write 



*A1 



Vz 



□ 



Then, we obtain ( |at| ). 

B Proof of Theorem 2 

Here, we provide a proof of Theorem 2. First, we prove the following lemma. 

Lemma 1. Suppose f3 G R", /3o £ R and g = Q/3 + yfio satisfy the following condi- 
tions: 



g^P^ = 0, > 0, A > 0, i e Bo, 
gi{di - ft) = 0, g,< 0, ft <d„ ie Bx, 

= 0. 



(B.la) 
(B.lb) 
(B.lc) 

(B.ld) 
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Then, Po, (3 and g are equal to /3o , /3 and g, respectively, where n is determined by 
the update rule 



Mk = M^_^_i U {i\l3,>0,gi = 0,ieBo} 
U {i\|3^ < d,,gi = 0,ie Bx], 
Ok = Ofe+ 1 U {i I A = 0, 3, > 0, i G Bo}, 
Ik = Ik+i U {i\|3^=d^,g,<Q,ie Bx}, 



(B.2) 



using (3 and g. 



Proof. Since the conditions (B. la) and (B.lb) hold, all of the elements of B is assigned 
to one of the three index sets by (B.S). From the definitions of Aik+i, Ofe+i and Xk+i 
(B.2), we see g^vi^. ^ = 0, (3q^ i ~ ^ ^^^'^ l^ik i ~ '^^k+i- Using these three equations 
and (B.ld), we can easily obtain the same linear system as (A.l). □ 

Next, we consider theorem 2. 

Theorem 2. Let Pq, (3 and g be the optimal solutions of the following QP problem: 



mm 



s.t. 



[y^3 = 0, g = Q^ + yPo, 



(B.3) 



and TT is determined by (BJ ) using f3 and g. Then Po, j3 and g satisfy (RA) and they 
are equal to the gradient Po, f3 and g, respectively. 

Proof. In this proof, we om it su bscript fc + i to simplify the notation. First, we rewrite 
the optimization problem (B.5) as follows: 



mm 



Y 9 Si 

jGBoUAIUO 



Y 9i(Pi - di) 



g^QP + yPo, 

S t. < < 0, < dx, 

Qm =0, Po ^ 0, 3i = dx, 

Althoug h we slightly modified the expression of the objective function, its value is the 
same as (B.3) as long as the equality constraints hold. From the inequality constraints, 
we see that the objective value is always non-negative in the feasible region. 
To simplify the notation, we introduce the following new variables: 



/3 

y 
Q 



E^^^d + Ep, g = Eg, 

Ey, 

EQE, 
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where 



E. 



E. 



(1) 

(2) 
ij 

E 



1 iox {{i,j)\i= j,i£Bo^MVJO}, 

others , 

1 for |i=i,iG&UX}, 
others , 



E 



(1) 



1^(2) 



Moreover, if we set T = O U I, the optimization problem (B.3) is written as 



mm 



s.t. 



where 



f Q/3 + t/A) + r - g = 0, 
9a< = 0,gg > 0, 

t y^;3 = ro, 



are constants. Let ^ € ^ R'^^'-Mb ^ i^r £ RI'^'.i^b e Rl'^^p e E be the 

Lagrange muhipliers. Then, the Lagrangian is 



L = /3 Q/3 + roA) + T-'/3 + |' (^Q/3 + yft, + r - g 

where /Xg > 0, i>'g > 0. Differentiating L, we obtain 

II = 2Qfi + r + Qi + u + py^0, 
dp 



dL 



ro+$, y = 0, 



— = --^ + /i = 0, 
dg 

where D £ R" is a vector whose components are = 0, I'e = — i'b, i't ~ fT and 
/i £ K" has /i^ = /x^, /ig = ~Mb' At ~ Using these equations, we obtain the 
following dual problem: 



max 



s.t. 



-/3 Q/3-pro+^'r 

f 2Q^ + r + + £> + p|/ = 0, 

''0 + i^y = 0, 
-€ + A = o, 

=0, UB < 0, 
I At = 0. Ab < 0- 



(B.4) 



(B.5) 
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Using the constrain ts o f this problem (B.5), we can derive the following bound of the 
objective function (B.4): 



/3 Q/3-pr-o + l'r 



From this we see that the dual objective function is less than or equal to 0. Thus , the 
optimal objective value of the optimization problem is 0. Then the conditions (B.l) 
is satisfied. From lemma 1, the claim is proved. □ 



C Reformulate the Optimization Problem (10) 

We reformulate the optimization problem (10) to reduce the number of variables and 
constraints. Here again, we omit subscript of A^, O and X to simplify the notation. 
Define B = {61, . . . , b|s|}. So ^ {i € {!,..., | bi € Bo} and <Si = {i G 
I bi G Bx}. When \B\ / 0, the optimization problem (10) can be re- 
formulated as 



mm 
/3b 



s.t. 



Q'so:Pb+'"Bo > 0, 

Q's^,Pb + '"b^<0, 



where 



Q' 



Qn 



\yB Qe 



Vb 

Qb,m. 



dx 



Vx 
Qm.i] 

[V Q:.,M]'^+Q:,ldx- 



On the other hand, when \B\ = 0, (10) becomes 

PZQbPb + iQB,zdx - Qb,Bj:'^BiV /3b 



mm 



S.t. 



(?/idi + ys^d8x)/?o 
Vb^b + Vidi = 

Qbo.B^B + Qbo,!^'! + y Bo 1^0 > 

Qbt.bPb + Qsx.idi + ^ 

l/3e„ >0, (3B<dB^. 
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